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Abstract: This article addresses the problem of classification method based on 
both labeled and unlabeled data, where we assume that a density function for labeled 
data is different from that for unlabeled data. We propose a semi-supervised logistic 
regression model for classification problem along with the technique of covariate shift 

a • 

I > 

adaptation. Unknown parameters involved in proposed models are estimated by 
regularization with EM algorithm. A crucial issue in modeling process is the choices 
^-j. ■ of tuning parameters in our semi-supervised logistic models. In order to select 

. the parameters, a model selection criterion is derived from information-theoretic 

in 



approach. Some numerical studies show that our modeling procedure performs well 
in various cases. 

Key Words and Phrases: Covariate shift, EM algorithm, Model selection, Reg- 
ularization, Semi-supervised learning. 

1 Introduction 

In recent years, with the wide availability of fast and high-powered computers, high- 
throughput data of unexampled size and complexity have frequently been seen in contem- 
porary statistics and machine learning. Examples involve data from genomics, proteomics, 
natural language processing, and signal processing. For the huge amount of data, it is 
difficult to label data by human operator, since its work requires vast times and efforts. 
Only small labeled data set may, therefore, be available, while unlabeled data set can 
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be more easily obtained. Under such a circumstance, a classification method that com- 
bines both labeled and unlabeled data, called semi-supervised learning, has received an 
enormous amount of attention in the late machine learning and statistical literature (see, 
e.g., Chapelle et al., 2006; Liang et al., 2007). For overviews of semi-supervised learning 
methods, we refer to Zhu (2008), and references given therein. 

Many classification techniques for semi-supervised learning have been proposed by 
various researchers, e.g., Amini and Gallinari (2002), Basu et al. (2004), Bennett and 
Demiriz (1998), Chen and Wang (2007), Dean et al. (2006), Kawano et al. (2010), 
Kawano and Konishi (2011), Lafferty and Wasserman (2007), and Zhou et al. (2004). 
Most of these semi-supervised methods implicitly assumes that a density function for 
labeled data is the same as that for unlabeled data. On the other hand, we, here, con- 
sider the case that the densities for labeled data and unlabeled data are different, since 
the densities are not always same in practical situations. In such a case, several semi- 
supervised methods have been presented, e.g., Jiang and Zhai (2007), Wu et al. (2009), 
and Zadrozny (2004). However, for these methods, there remains a problem of evaluating 
constructed semi-supervised models, which is a crucial issue in model building process. 
Cross validation (CV) is often used in evaluating models constructed by semi-supervised 
procedure. An advantage of CV lies in its independence from probabilistic assumptions. 
The computational time of the procedures is, however, very large, and the high variabil- 
ity and tendency to undersmooth in CV are not negligible in the analysis of complex or 
high-dimensional data, since the selectors are repeatedly applied. 

In this paper, we propose a logistic model for semi-supervised classification problem 
by using statistical methods under covariate shift (Shimodaira, 2000) in the case that the 
density function for labeled data is different from that for unlabeled data. The unknown 
parameters in the model are estimated by regularization method with the help of EM algo- 
rithm. A crucial issue in our modeling strategy is to choose values of some tuning param- 
eters included in semi-supervised logistic models, which corresponds to evaluating models 
determined by our proposed procedures. In order to objectively select optimal values of 
tuning parameters, we then introduce a model selection criterion based on information- 
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theoretic approach (Konishi and Kitagawa, 1996) that evaluates semi-supervised logistic 
models estimated by regularization method. Some numerical examples demonstrate that 
the proposed procedure works well and performs better than competing methods. 

This paper is organized as follows. In Section 2, we present a semi-supervised logistic 
model for classification problem based on covariate shift adaptation and its estimation 
by regularization method. Section 3 provides a model selection criterion derived from 
information-theoretic viewpoint to select some tuning parameters in our logistic models. 
In Section 5, Monte Carlo simulations and benchmark data analysis are given to assess 
the performances of proposed semi-supervised logistic discrimination. Some concluding 
remarks are given in Section 5. 

2 Semi-supervised logistic modeling from different 
sampling distributions 

2.1 Linear logistic modeling for semi-supervised learning 

We review here semi-supervised linear logistic models developed by early researchers (e.g., 
Amini and Gallinari, 2002; Vittaut et al, 2002). Suppose that we have an rii labeled data 
set {(x a , y a ); a = 1, . . . , ni} and an {n — n x ) unlabeled data set {x a ; a = rii + 1, . . . , n}, 
where x a = (x a i, . . . , x ap ) T denotes a p-dimensional explanatory variable and Y a is a 
random variable taking values or 1 with probabilities 



Note that logistic models are first constructed by only the labeled data set, while the 
unlabeled data set is used in estimating the parameters involved in the logistic models. 

Using posterior probabilities in Equation (JTJ and the labeled data set, a linear logistic 
model (see, e.g., Hastie et al, 2009) is formulated by 







(2) 
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where w = (w , wi, . . . , w p ) is an unknown parameter vector and x* a = (1, x^) ■ Here- 
after, we denote posterior probabilities by n(x a ; w), since the posterior probabilities de- 
pend on the parameter vector w. It follows from Equation ([2]) that posterior probabilities 
can be rewritten as 

exp(w T x* a ) ^ 



1 + exp(w? J x 

Also, a probability function of random variable Y a is the Bernoulli distribution in the 
form 

f{y a \x a ;w) =7r(x a ;w) ya {l-7T(x a ;w)} 1 ~ Va , y« = 0,l. (4) 
Under the linear logistic model, the log-likelihood for y a in terms of w is induced into 

a=l 

ni 

= ^ a lognix^w) + (1 -y a ) log{l - n(x a ;w)}] 

a=l 
ni 

= Yl [y* wTx *a - iogi 1 + ex pO T 0}] • ( 5 ) 



a=l 



By ordinary, the unknown parameter w included in the logistic model is estimated 
by maximizing the log-likelihood function with respect to the parameter. The procedure 
is known as the supervised learning, i.e., the parameter is determined by using only 
labeled data set. Since we have an additional unlabeled data set, the parameter should 
be estimated by both labeled and unlabeled data set, which is called the semi-supervised 
learning. Thereby, Amini and Gallinari (2002) proposed a log-likelihood function with 
additional unlabeled data given by 

"i 

t{w) = [y*™ T < - Ml + exp(™ r <)}] 

a=l 

n 

+ [t a w T x* a -\og{l + exp(w T x* a )}}, (6) 

where t a (a = n\ + 1, . . . , n) is a latent variable coded as or 1. Amini and Gallinari 
(2002) estimated the parameter by maximizing the Equation (jH]) with the technique of 
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EM algorithm, while Kawano and Konishi (2011) employed the Equation with reg- 
ularization term in estimating the parameter in the context of nonlinear logistic model 
based on basis expansion. 

Given the estimate w, we assign a future observation X{ into class j (j = 0, 1) that 
has the maximum posterior probability in the Equation 

2.2 Semi-supervised logistic model from different distributions 

Logistic models using semi-supervised learning described in Section 12.11 usually assumes 
that a density function for the labeled data set is the same as that for the unlabeled 
data set, i.e., when we denote that q\abei{ x ) is a probability distributional function of 
explanatory variables for the labeled data and gWabei ( x ) is that for the unlabeled data, 
Q\abe\( x ) — QWdabci^) • Our aim in this section is to construct logistic models under the 
situation that a density for the labeled data set is different from that for the unlabeled 
data set, i.e., gi ab ciO) ^ g U niabci(^)- 

We recall the log-likelihood function for logistic model with unlabeled data in Equation 
OH]). For the log-likelihood function, we propose a weighted log-likelihood function with 
unlabeled data in the form 

t{w- 7l , 72 ) = jr (^^r [y a w T x* a - log{l + ex P (^<)}] 

~ I glabella J J 

+ £ ( ^feJ [t a w T xl - log{l + exp(^<)}] , (7) 

where 71,72 € [0, 1] are tuning parameters. If both 71 and 72 is 0, the log-likelihood in 
Equation (J7|) coincides with that in Equation (Q. Note that the weight on the first term, 
(?uniabei(#) / '^labei^) 5 is bigger near high density of unlabeled data, while that on the second 
term, qi a bei(x) / q U nisbei( x ) , is strengthen near high density of labeled data. Hence, the log- 
likelihood function on the first term is highly weighted near high density of unlabeled data, 
while that on the second term has high weighting near high density of labeled data. An 
idea of the weight, the ratio of <7i a bei(a0 an d 9 U niabei(*), arises from a statistical inference 
under covariate shift (Shimodaira, 2000). In semi-supervised learning, employing a ratio 
of densities in log-likelihood functions is not new. For example, Sokolovska et at. (2008) 
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and Zou et al. (2007) use a ratio of densities in semi-supervised inference. However, the 
Equation (j7|) is a novel formulation in semi-supervised context. 

The Equation (J7]) includes unknown values of ratios, (/unlabel (*) / Qiabei(x) and <7i a b e i(£c) I ^unlabel 
which are to be estimated. Various researchers address the problem of estimating the ra- 
tios by using several methods of statistics or machine learning (Bickel et al, 2009; Huang 
et al, 2007; Kanamori et al, 2009; Sugiyama et al., 2008). In this paper, we employ a 
uLSIF method proposed by Kanamori et al. (2009) in determining values of the ratios and 



implement the method by a source code given in http://www.math.cm.is.nagoya-u.ac.jp/ 
~kanamori/ 'software /LSIF. We do not follow details of density ratio estimation by the 
uLSIF method, since these are not our focus in this paper. For readers that are interested 
in the topics, we refer to Kanamori et al. (2009). 

2.3 Parameter estimation via regularization 

In estimating parameters in logistic models, the log-likelihood function often diverges to 
infinity when the maximum likelihood method is applied (Konishi and Kitagawa, 2008). 
Hence, the parameter vector w in Equation (ED) is estimated by regularization method. 
The regularization method achieves to maximize a following regularized log-likelihood 
function 

t x (w; 71, 72) = t(w; 7l , 7a ) - ^w T Kw, (8) 

where A is a regularization parameter that has positive values and K = diag(0, I p ) is a 
(p + 1) x (p + 1) matrix. Here, the matrix I p is a p-dimensional identity matrix. 

It is not easy to optimize the parameter involved in Equation (jSJ), since the latent 
variables t a (a = rii + l, . . . ,n) are unobserved. Hence, we employ an EM-based algorithm 
developed by Kawano and Konishi (2011) as follows: 

Stepl Estimate the parameter vector w by maximizing the regularized log-likelihood 
function using only labeled data set {(x a ,y a ); a = 1, . . . ,n{\ along with the tech- 
nique of Newton-Raphson method. 

Step2 Construct a classification rule ir(x a ; w). 



Step3 According to the classification rule in Step2, compute the posterior probabilities 
n(x a ; w) for unlabeled data set x a (a = n\ + 1, . . . , n). By the use of the posterior 
probabilities, estimate t a in the form i a = ir(x a ; w). 

Step4 Replace t a into t a in the regularized log-likelihood function ([8]), and then determine 
the parameter vector w through the maximization of the log-likelihood function in 
Equation ([8]) with the help of Newton-Raphson method. 

Step5 Repeat the Step2 to the Step4 until the following condition 

|^(^ fc+1 ); 7l , 72 )-^(^ (fc) ;7i,7 2 )l<^ (9) 

is satisfied, where w^ k ' is the value of w after the k-th. EM iteration and e is an 
arbitrary small number (e.g., 10~ 5 ). 

It follows from these procedures that we obtain a statistical model in the form 

f(y\x; w) = n(x; w) y {\ - tt(x; w)} 1 ~ y . (10) 

Note that the statistical model is constructed by using both labeled data and unlabeled 
data. 

3 Model selection criterion 

The statistical model in Equation (fTOl) contains some adjusted parameters including two 
tuning parameters 71,72 in the weighted log-likelihood function and the regularization 
parameter A. Regarding selection of these adjusted parameters as that of candidate 
models, we introduce a model selection criterion from information-theoretic approach. 

Akaike (1974) introduced the Akaike information criterion (AIC) for evaluating sta- 
tistical models estimated by maximum likelihood method. It is, however, difficult for the 
AIC to evaluate models given by estimation procedures except for maximum likelihood 
method, whereas the AIC is widely used in many fields of research. By extending the AIC, 
Konishi and Kitagawa (1996) derived an information criterion, which can evaluate models 
given by the M-estimator including regularization method. Using this result, we present 
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a generalized information criterion (GIC) for evaluating our proposed semi-supervised lo- 
gistic models estimated by regularization method. The model selection criterion is given 
by 

GIC = -2 £ ( g^^ r hg f(y a \x a , w) + 2tr {Q(w)R-\w)} , (11) 

^ I glabella) J 

where the matrices Q{w) and R(w) are 

Q(w) = ^ {x T W 2 A 2 X - XKwl^WAx} , (12) 

R(w) = — XIlW(I ni - fl)X + XK. (13) 
ni 

Here, l m is an ni-dimensional vector the elements of which are all 1, I ni is an n\- 
dimensional identity matrix. Also, X, W, A, and II are, respectively, given by 

X = ( X li ■ ■ ■ > ^nj 2 ") 

t 9t i. f ?unlabel(^l) 1 f ^unlabel (-^m ^ 

ly = diag 



A = diag [yi - 7r(cci; w), ...,y ni - n(x ni ; w)} , 
fl = diag [7r(xi; w),..., ir(x ni ; w)] . 

We choose adjusted parameters from the minimizer of the GIC in Equation (TTT|) . 



4 Numerical study 

We studied some numerical examples to show the efficiency of our proposed modeling 
strategy. Two types of Monte Carlo simulations and benchmark data analysis are given 
to illustrate the proposed semi-supervised logistic discrimination. 

4.1 Simulation 1 

We investigated the effectiveness of the proposed modeling procedures through Monte 
Carlo simulation. In this simulation study, we generated data sets {(xi a X2 a , Da)'-, on = 
1, . . . , n) as labeled data and {(xi Q , X2 a )', a = 1, . . . , 500} as unlabeled data. In labeled 
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Table 1: Comparisons of prediction error rates (%) for several number of data points. 



Method \ # of labeled data 


25 


50 


100 


150 


200 


250 


SSLRCS 


33.3 


33.3 


33.9 


34.8 


35.5 


35.0 


LSSLR 


34.3 


34.4 


34.2 


35.3 


35.9 


35.6 


SLR 


35.6 


34.3 


34.3 


35.2 


35.8 


35.6 



data, (xia, X2a) were generated by normal distribution N((— 0.9, 1 — sin(sin(0.9 2 7r))) T , diag 
(0.0015,2)), and y a was generated according to a following conditional probability 

Pr(Y = l|xi,x 2 ) = V [1 + exp {-sin(27rx 2 ) - x 2 + l}] . (14) 

Meanwhile, unlabeled data (xi a ,X2a) were obtained by normal distribution iV((— 0.4, 1 — 
sin(sin(0.4 2 7r))) T , diag(0.05, 1)). Test data set {(xi a , x 2a , y a )\ ex = 1, ... , 1000} was gener- 
ated as follows. First, (xi a ,X2 a ) were derived by mixture of labeled and unlabeled data, 
where the mixing rate is equal (that is, 0.5). Second, for the (xi a ,x 2Q ,), y a was obtained 
according to the conditional probability in Equation ( fl4l) . We assumed that labeled data 
sizes (n) were 25, 50, 100, 150, 200, and 250. 

We fitted our semi-supervised logistic regression model to the data sets. Note that 
the density ratio estimation procedure by uLSIF method described in Section 2.2 is not 
performed in this simulation trials, since the density ratio is exactly calculated. The 
simulation results were obtained by averaging over 50 repeated Monte Carlo trial. The 
tuning parameters in our models were selected by using the GIC in Equation (fTTTl . The 
values of tuning parameters were 71 = 0.10, 72 = 0.610, and A = 10 -2 ' 20 , which were 
averaged over 50 repetitions. The results are summarized in Table [TJ 

We compared the performances of the proposed semi-supervised methodologies (SSLRCS: 
semi-supervised logistic regression under covariate shift) with those of semi-supervised 
method proposed by Amini and Gallinari (2002) (LSSLR: linear semi-supervised logis- 
tic regression), which is developed under the condition that density functions for labeled 
and unlabeled data are same, and supervised linear logistic discriminant analysis (SLR: 
supervised logistic regression). Note that the SLR is constructed by using only labeled 
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data. Semi-supervised and supervised logistic modeling strategies were applied into the 
data sets. Since the LSSLR and the SLR include a tuning parameter, respectively, the 
parameter is determined by the GIC, where the GIC for LSSLR is obtained by setting 
<?uniabei(a?a) / 9iabei(^a) = 1 (a — 1, . . . , n\) in Equation (TTTT) and that for SLR is given by 
Ando et al. (2008). It may be seen from Tabled] that SSLRCS is superior to other meth- 
ods (LSSLR and SLR) in all cases in the sense that the proposed method gives smaller 
prediction error rates. 

4.2 Simulation 2 

We simulated three data sets given in Chakraborty (2011) to examine the performances 
of our proposed modeling strategy. For each of the simulation cases, we generated 100 
data points in the labeled data set, 1000 data points in the unlabeled data set, and 1000 
data points in the test data set. Using the data sets, we constructed the SSLRCS, the 
LSSLR, and the SLR. We repeated the procedure 50 times. Our simulation settings are 
given as follows (for details, see, Chakraborty (2011, p. 76)): 

• Case 1 : In the labeled data set, generate x = (xi,x 2 ) T given by x« ~ N(2, 1) {i = 
1,2) for Class 1 and Xi ~ iV(— 2, 1) (i = 1,2) for Class 2. In the unlabeled data 
set, Xi ~ N(2,2) (i = 1,2) for Class 1 and x { ~ N(-2,2) (i = 1,2) for Class 
2. In the test data set, x { ~ 0.5iV(2, 1) + 0.5iV(2, 2) (i = 1,2) for Class 1 and 
Xi ~ 0.5iV(-2, 1) + 0.5iV(-2, 2) (i = 1, 2) for Class 2. 

• Case 2 : Generate x = (xi, . . . , £io) T given by Xi ~ N(l, 3) (i = 1, . . . , 10) for Class 
1 and Xi ~ N(-l, 3) (i = 1, . . . , 10) for Class 2. 

• Case 3 : Generate x = (xi,X2) T given by Xj ~ iV(5,2) (i = 1,2) for Class 1 
and Xi ~ N(8,2) (i = 1,2) for Class 2 in the labeled data set. In the unlabeled 
data set, x { ~ JV(6, 2) {% = 1,2) for Class 1 and 2; ~ N(9,2) (i = 1,2) for Class 
2. In the test data set, x { ~ 0.5iV(5,2) + 0.5iV(6,2) (i = 1,2) for Class 1 and 
Xi ~ 0.5iV(8, 2) + 0.5iV(9, 2) (i = 1, 2) for Class 2. 
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Table 2: Comparisons of prediction error rates (%) for several cases. 



Method \ Data sets 


Case 1 


Case 2 


Case 3 


SSLRCS 


1.28 


3.65 


9.72 


LSSLR 


1.36 


4.19 


11.6 


SLR 


1.43 


5.05 


11.7 



The results from the simulation studies are in Table [2j The optimal tuning parameters 
selected by the GIC in our models were 71 = 1.00, 72 = 0.102, and A = 10~ 2 ' 50 for Case 
1, 71 = 1.00, 72 = 0.106, and A = 1(T L98 for Case 2, and 71 = 1.00, 72 = 0.106, and 
A = 10~ 1,98 for Case 3, respectively. From the simulation results, we observe that our 
proposed procedure performs well in all cases with respect to minimizing prediction error 
rates even though Case 2 is an ordinary setting of semi-supervised learning, i.e., the 
density function for labeled data is same as that for unlabeled data. Hence, we conclude 
that our proposed method may be useful even if the densities for labeled and unlabeled 
data are same. 

4.3 Benchmark data analysis 

Thorough analyzing glO data set (Chapelle and Zien, 2005), ionosphere data set (Sigillito 
et al, 1989), and pima data set (Ripley, 1996), we illustrated the effectiveness of the 
proposed semi-supervised methodology. The glO data set includes 550 data points with 
10 predictors, and we prepared 250 training data points and 300 test data points. The 
ionosphere data set consists of 356 data points with 33 predictors, and we split the whole 
356 data points into 150 training data points and 201 test data points. The pima data 
set, which consists of 300 training data points and 232 test data points, is a binary 
classification with 7 predictors. In order to implement semi-supervised procedure, the 
training data points were randomly split into two halves with labeled data points and 
unlabeled data points, where labeled data points were assigned as 5%, 10%, 20%, 30%, 
40%, and 50% for training data points, respectively. We repeated the random splitting 
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Table 3: Comparisons of prediction error rates (%) for some data sets. 



Method \ % 5 10 20 30 40 50 
glO 

SSLRCS 
LSSLR 
SLR 

Ionosphere 
SSLRCS 
LSSLR 
SLR 
Pima 

SSLRCS 26.6 26.9 26.6 26.8 26.7 26.7 
LSSLR 30.1 27.0 27.0 27.0 26.9 26.7 

SLR 29.3 26.9 26.9 27.0 26.8 26.7 



3.40 
26.6 
26.4 



3.47 
16.2 
16.4 



3.85 
9.94 
9.30 



4.06 
7.04 
6.85 



4.66 
5.66 
5.45 



5.42 
4.77 
4.62 



18.2 
29.0 
28.9 



17.3 
22.8 
23.1 



16.9 
18.9 
19.5 



16.4 
17.4 
18.0 



17.3 
16.2 
16.7 



16.8 
15.4 
15.7 



50 times. We also compared our proposed method (SSLRCS) with the LSSLR and the 
SLR, which is described in Section 4.1. 

Table [3] shows the prediction errors for the benchmark data sets. We obtained optimal 
values of tuning parameters included in our proposed models as follows: 71 = 1.00, 
72 = 0.154, and A = 10~ 3 - 20 for glO data set, 7l = 0.992, 72 = 0.504, and A = 10~ 2 - 89 for 
ionosphere data set, and 71 = 1.00, 72 = 0.308, and A = 10 1 ' 41 for pima data set, which 
are averaged over 50 repetitions. From the results, we find that our proposed procedure 
outperforms the previously proposed methods in almost all situations, although it is 
unclear that whether densities for labeled and unlabeled data are different. In particular, 
the proposed method seems to work well when the number of labeled data points is small. 
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5 Concluding remarks 

We proposed a semi-supervised logistic classification methodology for different density 
functions of labeled and unlabeled data along with the technique of covariate shift adap- 
tation and regularization. A crucial point for our semi-supervised modeling processes 
includes the choices of some tuning parameters in our proposed models. We introduced 
a model selection criterion from the viewpoints of information-theoretic approach in or- 
der to select the values of the adjusted parameters. Through Monte Carlo simulations 
and benchmark data analysis, we showed that our modeling strategy is effectiveness in 
practical situations in the viewpoints of yielding relatively lower prediction errors than 
previously developed methods. Our modeling procedure may be applied into the problem 
of constructing a nonlinear semi-supervised discriminant model based on basis expansion, 
which will be discussed in another paper. 
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